1 Preliminaries

1.2 SNP data: genotyping of various population

1.2.1 Data description

A single-nucleotide polymorphism is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of 0.5% from person to person in the population. They are coded as 0, 1 or 2 (meaning 0, 1 or 2 allels different regarding the reference population)

See the grat wikipedia page for detail!

We can measure SNP for individuals with high trhoughput technology and SNP array. SNP chips for human contains more than 1 million variables! We only suggest to analyse a sample of a data set containing the 5500 most variant SNP for 728 individuals with various origin, with the following descriptors:

  • CEU: Utah residents with Northern and Western European ancestry from the CEPH collection
  • GIH: Gujarati Indians in Houston, Texas
  • LWK: Luhya in Webuye, Kenya
  • MKK: Maasai in Kinyawa, Kenya
  • TSI: Toscani in Italia
  • YRI: Yoruba in Ibadan, Nigeria

The data are imported as follows:

The first column is a categorical variable describing the orgin of each individual, with details on the acronyme given above

1.2.2 Questions

  1. Fit a PCA on these data. Justify the scaling or not.
  2. Represent the scree plot for the 100 first axes. Comment
  3. Plot individual factor maps on axes 1, 2 and 3. Add color and ellipse associated with the origin
  4. Check who are the first, say, 250 most contributive individual to these axes. Show them in the projection. Have a look at the cosine of the most influent guys.
  5. Plot the correlation circle. Only retain variables based on the quality of their representation and/or degree of contribution to the axes represented.
  6. Summarize the above analyses in biplots. Add fancy colors
  7. Indicate a group of individual as supplmentary (i.e., not use to fit the PC). Show how excluding a group influence (or not) the fit and the projection, by exploring several groups. Explain

1.2.3 Solution (compulsive click forbidden!)

  1. Fit a PCA on these data. Justify the scaling or not.

I do not scale, since SNP value are suppose to live on the same scale (values in \(\{0, 1, 2\}\)).

## Warning in PCA(snp, quali.sup = 1, scale.unit = FALSE, graph = FALSE, ncp =
## 500): Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package
  1. Represent the scree plot for the 100 first axes. Comment

Frist axes more informative than the other, but information is generally well spread.

  1. Plot individual factor maps on axes 1, 2 and 3. Add color and ellipse associated with the origin

Argument habillage or col.ind will have the same effect, by the first will be more useful later.

Impressive how the population are well separated!

  1. Check who are the first, say, 250 most contributive individual to these axes. Show them in the projection. Have a look at the cosine of the most influent guys.

  1. Plot the correlation circle. Only retain variables based on the quality of their representation and/or degree of contribution to the axes represented.

  1. Summarize the above analyses in biplots. Add fancy colors

Just example, you can do better/different than that!

  1. Indicate a group of individual as supplmentary (i.e., not use to fit the PC). Show how excluding a group influence (or not) the fit and the projection, by exploring several groups. Explain

Depending on the proximity of the group to the cloud and to some particular existing groups, the fit is more or less altered.

## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "MKK"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package

## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "TSI"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package

## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "GIH"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package

1.3 MNIST data

1.3.1 Data description

The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset. It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. It is commonly used for training various image processing systems.[1][2] The database is also widely used for training and testing in the field of machine learning.

1.3.2 Questions

1.3.3 Solution (compulsive click forbidden!)